Measure representation and multifractal analysis of complete genomes 



Zu-Guo Yu 1,2 *, Vo Anh 1 and Ka-Sing Lau 3 
1 Centre in Statistical Science and Industrial Mathematics, Queensland University 

of Technology, GPO Box 2434, Brisbane, Q 4001, Australia. 
2 Department of Mathematics, Xiangtan University, Hunan 411105, P. R. China.^ 
3 Department of Mathematics, Chinese University of Hong Kong, Shatin, Hong Kong 



abstract— This paper introduces the notion of measure 
representation of DNA sequences. Spectral analysis and mul- 
tifractal analysis are then performed on the measure repre- 
sentations of a large number of complete genomes. The main 
aim of this paper is to discuss the multifractal property of 
the measure representation and the classification of bacteria. 
From the measure representations and the values of the D q 
spectra and related C q curves, it is concluded that these com- 
plete genomes are not random sequences. In fact, spectral 
analyses performed indicate that these measure representa- 
tions considered as time series, exhibit strong long-range cor- 
relation. Here the long-range correlation is for the isT-strings 
with the dictionary ordering, and it is different from the base 
pair correlations introduced by other people. For substrings 
with length K = 8, the D q spectra of all organisms studied are 
multifractal-like and sufficiently smooth for the C q curves to 
be meaningful. With the decreasing value of K, the multifrac- 
tality lessens. The C q curves of all bacteria resemble a clas- 
sical phase transition at a critical point. But the 'analogous' 
phase transitions of chromosomes of non-bacteria organisms 
are different. Apart from Chromosome 1 of C. elegans, they 
exhibit the shape of double-peaked specific heat function. A 
classification of genomes of bacteria by assigning to each se- 
quence a point in two-dimensional space (D_i,_Di) and in 
three-dimensional space , D\ , D-2) was given. Bacteria 
that are close phylogenetically are almost close in the spaces 
and (£)_ x , d, £>_ 2 ). 

PACS numbers: 87.10+e, 47.53+n 

Key words: Measure representation, spectral analysis, mul- 
tifractal analysis, dimension spectrum, 'analogous' specific 
heat. 



I. INTRODUCTION 

DNA sequences are of fundamental importance in un- 
derstanding living organisms, since all information of the 
hereditary and species evolution is contained in these 
macromolecules. The DNA sequence is formed by four 
different nucleotides, namely adenine (a), cytosine (c), 
guanine (g) and thymine (t). A large number of these 
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DNA sequences is widely available in recent times. One 
of the challenges of DNA sequence analysis is to deter- 
mine the patterns in these sequences. It is useful to dis- 
tinguish coding from noncoding sequences. Problems re- 
lated to the classification and evolution of organisms are 
also important. A significant contribution in these stud- 
ies is to investigate the king-range correlation in DNA 
sequences^ 1-16 ]. Li et al. U found that the spectral den- 
sity of a DNA sequence containing mostly introns shows 
V/' 3 behaviour, which indicates the presence of long- 
range correlation when < /3 < 1. The correlation prop- 
erties of coding and noncoding DNA sequences were first 
studied by Peng et al. in thcirJractal landscape or 
DNA walk model. The DNA walk fel was defined as that 
the walker steps "up" if a pyrimidine (c or t) occurs at 
position i along the DNA chain, while the walker steps 
"down" if a purine (a or g) occurs at position i. Peng et 
al. o discovered that there exists long-range correlation 
in noncoding DNA sequences while the coding sequences 
correspond to a regular random walk. By undertaking 
a more detailed analysis, Chatzidimitriou et al. con- 
cluded that both coding and noncoding sequences exhibit 
long-range correlation. A subsequent work by Prabhu 
and Claverie B also substantially corroborates these re- 
sults. If one considers more details by distinguishing c 
from t in pyrimidine, and a from g in purine (such as 
two or three-dimensionaLDNA walk models O and maps 
given by Yu and Chen H), then the presence of base cor- 
relation has been found even ip-poding sequences. On the 
other hand, Buldyrev et al. H showed that long-range 
correlation appears mainly in noncoding DNA using all 
the DNA sequences available. Based on equal-symbol 
correlation, Voss B showed a power law behaviour for the 
sequences studied regardless of the proportion of intron 
contents. These studies add to the controversy about 
the possible presence of correlation in the entire DNA or 
only in the noncoding DNA. From a different angle, frac- 
tal analysis is a relative new analytical technique that 
has proven useful in revealing complex patterns in nat- 
ural objects. Berthelsen et al. O considered the global 
fractal dimensions of human DNA sequences treated as 
pseudorandom walks. 

In the above studies, the authors only considered short 
or long DNA segments. Since the first complete genome 
of the free-living bacterium Mycoplasma genitalium was 
sequenced in 1995 c3, an ever-growing number of com- 
plete genomes has been deposited in public databases. 
The availability of complete genomes induces the pos- 
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sibility to establish some global properties of these se- 
quences. Vieira O carried out a low-frequency analy- 
sis of the complete DNA of 13 microbial genomes and 
showed that their fractal behaviour does not always pre- 
vail through the entire chain and the autocorrelation 
functions have a rich variety of behaviours including 
the presence of anti-persistence. Yu and Wang G9 pro- 
posed a time series model of coding sequences in com- 
plete genomes. For fuller details on the number, size and 
ordering of genes aloag the chromosome, one can refer 
to Part 5 of Lewin EJ. One may ignore the composi- 
tion of the four kinds of bases in coding and noncod- 
ing segments and only consider the global structure of 
the complete genomes or long DNA sequences. Provata 
and Almirantis E3 proposed a fractal Cantor pattern 
of DNA. They mapped coding segments to filled regions 
and noncoding segments to empty regions of a random 
Cantor set and then calculated the fractal dimension of 
this set. They found that the coding/noncoding partition 
in DNA sequences of lower organisms is homogeneous- 
like, while in the higher eucariotes the partition is frac- 
tal. This result doesn't seem refined enough to distin- 
guish bacteria because the fractal dimensions of bacteria 
given by them E3 are all the same. The classification 
and evolution relationship of bacteria is one of the most 
important problems in DNA research. Yu and Anh E3 
proposed a time series model based on the global struc- 
ture of the complete genome and considered three kinds 
of length sequences. After calculating the correlation di- 
mensions and Hurst exponents, it was found that one can 
get more information from this model than that of fractal 
Cantor pattern. Some results on the classification and 
evolution relationship of bacteria were found 123. The 
correlation-property of these length sequences has been 
discussed 6a. 

Although statistical analysis performed directly on 
DNA sequences has yielded some success, there has been 
some indication that this method is not powerful enough 
to amplify the difference between a DNA sequence and a 
random sequence as well as tp-distinguish DNA sequences 
themselves in more details O . One needs more power- 
ful global and visual methods. For this purpose, Hao et 
al. cJ proposed a visualisation method based on count- 
ing and coarse-graining the frequency of appearance of 
substrings with a given length. They called it the portrait 
of an organism. They found that there exist some fractal 
patterns in the portraits which are induced by avoiding 
and under-represented strings. The fractal dimension of 
the limit set of portraits was also discussed ^c3'c3 . There 
are other graphical methods o f s e quence patterns, such 
as chaos game representation EaEa . ,_, 

In the portrait representation, Hao et al. <E3 used 
squares to represent substrings and discrete colour grades 
to represent the frequencies of the substrings in the com- 
plete genome. It is difficult to know the accurate value 
of the frequencies of the substrings from the portrait rep- 
resentation. In order to improve it, in this paper we use 
subintervals in one-dimensional space to represent sub- 



strings and then we can directly obtain an accurate his- 
togram of the substrings in the complete genome. We 
then view the histogram as a measure, which we call the 
measure representation of the complete genome. When 
the measure representation is viewed as a time series, a 
spectral analysis can be carried out. 

Global calculations neglect the fact that DNA se- 
quences are highly inhomogeneous. Multifractal analysis 
is a useful way to characterise the spatial inhomogeneity 
of both theoretical and experimental fractal patterns t^Hl . 
Multifractal analysis was initially proposed to treat tur- 
bulence data. In recent years it has been applied suc- 
cessfully >iH|-^Qany different fields including time series 
analysis ElH and financial modelling (see Anh et al. 
& ). For DNA sequences, application of the multifrac- 
tal technique seems rare (we have found only Berthelsen 
et al. o). In this paper, we pay more attention to this 
application. The quantities pertained to spectral and 
multifractal analyses of measures are described in Sec- 
tion 3. Application of the methodology is undertaken in 
Section 4 on a number of representative chromosomes. A 
discussion of the empirical results and some conclusions 
are drawn in Section 5, where we also address the use of 
the multifractal technology in the classification problem 
of bacteria. 



II. MEASURE REPRESENTATION 

We call any string made of K letters from the set 
{<?, c, a, t} a if-string. For a given K there are in to- 
tal 4 K different if -strings. In order to count the number 
of each kind of if-strings in a given DNA sequence A K 
counters are needed. We divide the interval [0, 1[ into 4 K 
disjoint subintervals, and use each subinterval to repre- 
sent a counter. Letting s = Si ■ ■ ■ sk, Si € {a, c, g, t}, i = 
1, • • • , K, be a substring with length K, we define 

i=i 

where 



0. 


if Si 


= a, 


1, 


if Si 


= c, 


2. 


if Si 


= .9, 


3. 


if Si 


= *, 



(2) 



and 

x r (s) = xi(s) + (3) 

We then use the subinterval [xi(s),x r (s)[ to represent 
substring s. Let Nk{s) be the number of times that sub- 
string s with length K appears in the complete genome. 
If the number of bases in the complete genome is L, we 
define 
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F K (s)=N K (s)/(L-K + l) (4) 

to be the frequency of substring s. It follows that 
J2{s} Fk( s ) = 1- Now we can define a measure fix on 
[0, 1[ by duj<(x) = Y(x)dx, where 

Y K (x) = A K F K {s), when x € [xi(s),x r (s)[. (5) 

It is easy to see J* dnx(x) = 1 and [j,k([xi(s), x r (s)[) = 
Fk(s). We call fix the measure representation of the 
organism corresponding to the given K . As an example, 
the histogram of substrings in the genome of M. genital- 
ium for K = 3, ...,8 are given in FIG. |. Self-similarity 
is apparent in the measure. 

For simplicity of notation, the index K is dropped in 
Fk{s), etc., from now on, where its meaning is clear. 

Remark: The ordering of a, c, g, t in (|) will give 
the natural dictionary ordering of if-strings in the one- 
dimensional space. A different ordering of ii"-strings 
would change the nature of the correlations. But in our 
case, a different ordering of a, c, g, t in (||) give almost tha 
same D q curve (therefore, the same with the C q curve) 
which will be defined in the next section when the abso- 
lute value of q is relative small. We give the FIG. || to 
support this point of view. Hence a different ordering of 
a, c, g, t in (|^) will not change our result. When we want 
to compare different bacteria using the measure repre- 
sentation, once the ordering of a, c, g, t in (Q) is given, it 
is fixed for all bacteria. 



III. SPECTRAL AND MULTIFRACTAL 
ANALYSES 

We can order all the F(s) according to the increas- 
ing order of Xi(s). We then obtain a sequence of real 
numbers consisting of A K elements which we denote as 
F(t), t = 1, • ■ • , A K . Viewing the sequence {F(t)}f =1 as a 
time series, the spectral analysis can then be undertaken 
on the sequence. ,— , 

We first consider the discrete Fourier transform E3 of 
the time series F(t), t — 1, • • • , 4 K , defined by 

N-l 

F(f) = N-i^F{t + l)e- 2 ™ ft . (6) 
t=o 

Then 

S(f) = \F(f)\ 2 (7) 

is the power spectrum of F(t). In recent studies, it 
has been found G^l that many natural phenomena lead 
to the power spectrum of the form l//' 3 . This kind of 
dependence was named 1/f noise, in contrast to white 
noise S(f) — const, i.e. (3 = 0. Let the frequency / take 
k values f k = k/N, k = 1, • • • , N/8. From the ln(5(/)) 
vs. ln(/) graph, we can infer the value of (3 using the 
above low-frequency range. For example, we give the log 



power spectrum of the measure of E. coli with K = 8 in 
FIG. §. 

The most common operative numerical implementa- 
tions of multifractal analysis are the so-called fixed-size 
box- counting algorithms EJ. In the one-dimensional 
case, for a given measure n with support E C R, we 
consider the partition sum 

Z e (q)= W)V, (8) 

q € R, where the sum runs over all different nonempty 
boxes B of a given side e in a grid covering of the support 
E, that is, 

B = [fee, (k + l)e[. (9) 

The exponent r(q) is defined by 

, „ , In Z e (q) . . 

r(?) = lim— ^ 10 
t^o me 

and the generalized fractal dimensions of the measure are 
defined as 

D g =T(q)/(q-l), forg^l, (11) 

and 

L> = lim^£, for (7 = 1. (12) 
q e^o hie 

where Zi e — X^(B)^o hi/i(i?). The generalized 
fractal dimensions are numerically estimated through a 
linear regression of 

-^-r In Z £ (q) 

q - 1 

against lne for q 1, and similarly through a linear 
regression of Z\ ie against loge for q = 1. For example, 
we show how to obtain the D q spectrum using the slope of 
the linear regression in FIG. | D\ is called information 
dimension and Di is called correlation dimension. The 
D q of the positive values of q give relevance to the regions 
where the measure is large, i.e., to the if-strings with 
high probability. The D q of the negative values of q deal 
with the structure and the properties of the most rarefied 
regions of the measure. 

Some sets of physical interest have a nonanalytic de- 
pendence of D q on q. Moreover, this phenomenon has 
a direct analogy to the phenomenpa of phase transi- 
tions in condensed-matter physics 123 . The existence 
and type of phase transitions might turn out to be a 
worthwhilexharacterisation of universality classes for the 
structures H . The concept of phase transition in multi- 
fractal spectra was introduced in the study of logistic 
maps, Julia sets and other simple systems. Evidence 
of phase transition was found in the multifractal spec- 
trum of diffusion-limited aggregation c!J. By following 
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the thermodynamic formulation of multifractal measures, 
Canessa & derived an expression for the 'analogous' spe- 
cific heat as 

C « = « 2t ^ T (l + !) - T (l !)' ( 13 ) 

He showed that the form of C q resembles a classical phase 
transition at a critical point for financial time series. In 
the next section, we discuss the property of C q for our 
measure representations of organisms. 

IV. DATA AND RESULTS 

More than 33 bacterial complete genomes are 
now available in public databases. There are six 
Archaebacteria: Archaeoglobus fulgidus, Pyrococcus 
abyssi, Methanococcus jannaschii, Pyrococcus horikoshii, 
Aeropyrum pernix and Methanobacterium thermoau- 
totrophicum; five Gram-positive Eubacteria: Mycobac- 
terium tuberculosis. Mycoplasma pneumoniae, My- 
coplasma genitalium, Ureaplasma urealyticum, and Bacil- 
lus subtilis. The others are Gram-negative Eubac- 
teria, which consist of two Hyperthermophilic bacte- 
ria: Aquifex aeolicus and Thermotoga maritima; four 
Chlamydia: Chlamydia trachomatisserovar, Chlamy- 
dia muridarum, Chlamydia pneumoniae and Chlamydia 
pneumoniae AR39; two Spirochaete: Borrelia burgdorferi 
and Treponema pallidum; one Cyanobacterium: Syne- 
chocystis sp. PCC6803; and thirteen Proteobacteria. 
The thirteen Proteobacteria are divided into four sub- 
divisions, which are alpha subdivision: Rhizobium sp. 
NGR234 and Rickettsia prowazekii; gamma subdivision: 
Escherichia coli, Haemophilus influenzae, Xylella fas- 
tidiosa, Vibrio cholerae, Pseudomonas aeruginosa and 
Buchnera sp. APS; beta subdivision: Neisseria menin- 
gitidis MC58 and Neisseria meningitidis Z2491; epsilon 
subdivision: Helicobacter pylori 399, Helicobacter pylori 
26695 and Campylobacter jejuni. 

And the complete sequences of some chromosomes of 
non-bacteria organisms are also currently available. In 
order to discuss the classification problem of bacteria. We 
also selected the sequences of Chromosome 15 of Saccha- 
romyces cerevisiae, Chromosome 3 of Plasmodium falci- 
parum, Chromosome 1 of Caenorhabditis elegans, Chro- 
mosome 2 of Arabidopsis thaliana and Chromosome 22 
of Homo sapiens. 

We obtained the dimension spectra and 'analogous' 
specific heat of the measure representations of the above 
organisms and used them to discuss the classification 
problem. We calculated the dimension spectra and 'anal- 
ogous' specific heat of chromosome 22 of Homo sapiens 
for K — 1,...,8, and found that the D q and C q curves 
of K = 6, 7, 8 are very close to one another (see FIG. || 
and ^). Hence it seems appropriate to use the measure 
corresponding to K = 8. For K = 8, we calculated the 
dimension spectra, 'analogous' specific heat and the ex- 
ponent (3 of the measure representations of all the above 



organisms. As an illustration, we plot the D q curves of M. 
genitalium, Chromosome 15 of Saccharomyces cerevisiae, 
Chromosome 3 of Plasmodium falciparum, Chromosome 
2 of Arabidopsis thaliana and Chromosome 22 of Homo 
sapiens in FIG. [?]; and the C q curves of these organisms 
in FIG. [§[ Because all D q are equal to 1 for the com- 
plete random sequence, from these plots, it is apparent 
that the D q and C q curves are nonlinear and significantly 
different from those of the completely random sequence. 
From FIG. @, we can claim that the curves representative 
of the organisms are clearly distinct from the curve rep- 
resenting a random sequence. From the plot of D q , the 
dimension spectra of organisms exhibit a multifractal- 
like form. From FIG. 0, we can see the linear fits of 
q = — 2, — 1, 1, 2 are perfect and better than that of other 
values of q, Hence we suggest to use D_2, D-i,Di,D% in 
the comparison of different bacteria. We give the numer- 
ical results for D_2, D-i, D±,D2 in Table [j] (from top to 
bottom, in the increasing order of the value of -D— i). 

If only a few bacteria are considered at a time, we 
can use the D q curve to distinguish them. This strat- 
egy is clearly not efficient when a large number of or- 
ganisms are to be distinguished. For this purpose, we 
suggest to use D_i,D\ and -D_2, in conjunction with 
two-dimensional points (D_i,Di) or three-dimensional 
points (-D_i, D\, -D_2). We give the distribution of 
two-dimensional points (D-i,D\) and three-dimensional 
points (D—x,Di, D-2) of bacteria in FIG. [)| 

V. DISCUSSION AND CONCLUSIONS 

The idea of our measure representation is,similar to the 
portrait method proposed by Hao et al. O . It provides 
a simple yet powerful visualisation method to amplify 
the difference between a DNA sequence and a random 
sequence as well as to distinguish DNA sequences them- 
selves in more details. If a DNA sequence is random, 
then our measure representation yields a uniform mea- 
sure (D q = 1, C q = 0). 

From the measure representation and the values of D q 
and C q , it is seen that there exists a clear difference be- 
tween the DNA sequences of all organisms considered 
here and the completely random sequence. Hence we 
can conclude that complete genomes are not random se- 
quences. 

We obtained the values of the exponent of our 
measure representations ((3 = 0.393003 for V. cholerae, 
(3 = 0.311623 for A. pernix, (3 = 0.240601 for X. fastid- 
iosa, (3 = 0.381293 for T. pallidum, (3 = 0.334057 for C. 
pneumoniae AR39, and (3 is larger than 0.4 for all other 
bacteria selected). These values are far from 0. Hence 
when we view our measure representations of organisms 
as time series, they are far from being random time se- 
ries, and in fact exhibit strong long-range correlation. 
Here the long-range correlation is for the if-strings with 
the dictionary ordering, and it is different from the base 
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pair correlations introduced by other people. 

Although the existence of the archaebacterial urking- 
dom has been accepted by many biologists, the classi- 
fication of bacteria is still a matter of controversy til . 
The evolutionary relationship of the three primary king- 
doms, namely archeabacteria, eubacteria and eukarjpte, 
is another crucial problem that remains unresolved ell . 

When K is large (K > 6), our measure representa- 
tion contains rich information on the complete genomes. 
From FIG. || and FIG. [| we find the curves of D q and 
C q are very close to one another for K = 6, 7, 8. Hence, 
for the classification problem, it would be appropriate to 
take K = 8. We calculated the /3, D q and C q values of 
all organisms selected in this paper for K = 8. We found 
that the D q spectra of all organisms are multifractal- 
like and sufficiently smooth so that the C q curves can be 
meaningfully estimated. From FIG. ||, with the decreas- 
ing of K , the multifractality becomes less severe. With 
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we found that the C„ curves of all other bac- 



teria resemble a classical phase transition at a critical 
point similar to that of M. genitalium shown in FIG. ||. 
But the 'analogous' phase transitions of non-bacteria or- 
ganisms are different. Apart from Chromosome 1 of C. 
elegans, they exhibit the shape of double-peaked specific 
heat function which is known to appear in the Hubbard 
model within the weak-to-strong coupling regime B . 

It is seen that the D q curve is not clear enough to dis- 
tinguish many bacteria themselves. In order to solve this 
problem we use two-dimensional points {D-\,D\) and 
three-dimensional points (D_x, Di, D_ 2 )- From FIG|], 
it is clear that bacteria roughly gather into two classes 
(as shown in Table |). Using the distance among the 
points, one can obtain a classification of bacteria. 

From Table |, we can see all Archaebacteria belong to 
the same class except M. jannaschii. And four Chlamy- 
dia almost gather together. It is surprised that the closest 
pairs of bacteria, Helicobacter pylori J99 and Helicobacter 
pylori 26695, Neisseria meningitidis MC58 and Neisse- 
ria meningitidis Z2491, group with each other. Two hy- 
perthermophilic bacteria group with each other and are 
linked with the Archaebacteria. It has previously been 
shown that Aquifex has close relationship with Archae- 
bacteria from the gene comparison of an enzyme needed 
for the synthesis of the amino acid trytophan £3 and us- 
ing the length sequence of complete genome O . In gen- 
eral, Bacteria that are close phylogenetically are almost 
close in the spaces (D^ 1: Di) and (D-i, D\, D_ 2 ). 
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TABLE I. The values of D-i, Di, D_2 and D2 of all bacteria selected. 



Species 


Category 


D-i 


D 1 


D- 2 


D 2 


Xylella fastidiosa 


Proteobacteria 


1.023935 


0.9734505 


1.046237 


0.9434007 


Treponema pallidum 


Spirochaete 


1.024096 


0.9744529 


1.048537 


0.9456879 


Vibrio cholerae 


Proteobacteria 


1.027849 


0.9754193 


1.060974 


0.9529402 


Bacillus subtilis 


Gram-positive Eubacteria 


1.031173 


0.9691831 


1.062364 


0.9392986 


Chlamydia trachomatis 


Chlamydia 


1.031900 


0.9705723 


1.067158 


0.9421241 


Chlamydia pneumoniae 


Chlamydia 


1.034190 


0.9691189 


1.075935 


0.9396138 


Rhizobium sp. NGR234 


Proteobacteria 


1.034821 


0.9689233 


1.068532 


0.9430141 


Chlamydia muridarum 


Chlamydia 


1.036608 


0.9646960 


1.075166 


0.9293640 


Chlamydia pneumoniae AR39 


Chlamydia 


1.037127 


0.9593074 


1.078164 


0.9106171 


Pyrococcus abyssi 


Archaebacteria 


1.038142 


0.9683081 


1.091387 


0.9393384 


Aeropyrum pernix 


Archaebacteria 


1.040248 


0.9535630 


1.074807 


0.9033159 


Synechocystis sp. PCC6803 


Cyanobacteria 


1.045674 


0.9657137 


1.127265 


0.9364141 


Mycoplasma pneumoniae 


Gram-positive Eubacteria 


1.046260 


0.9584649 


1.092869 


0.9250106 


Archaeoglobus fulgidus 


Archaebacteria 


1.047071 


0.9631252 


1.130371 


0.9279480 


Escherichia coli 


Proteobacteria 


1.047849 


0.9711645 


1.174754 


0.9474317 


M. thermoautotrophicum 


Archaebacteria 


1.048569 


0.9626480 


1.116451 


0.9306760 


Thermotoga maritima 


Hyperthermophilic bacteria 


1.053824 


0.9545637 


1.145209 


0.9101596 


Aquifex aeolicus 


Hyperthermophilic bacteria 


1.055210 


0.9540893 


1.134702 


0.9145361 


Pyrococcus horikoshii 


Archaebacteria 


1.056144 


0.9587924 


1.139402 


0.9237674 


Neisseria meningitidis MC58 


Proteobacteria 


1.058779 


0.9522681 


1.132902 


0.9132383 


Neisseria meningitidis Z2491 


Proteobacteria 


1.058805 


0.9497503 


1.133201 


0.9065167 


M. tuberculosis 


Gram-positive Eubacteria 


1.061496 


0.9410341 


1.115466 


0.8920540 


Haemophilus influenzae 


Proteobacteria 


1.062565 


0.9511231 


1.147970 


0.9122260 


Buchnera sp. APS 


Proteobacteria 


1.085581 


0.8955851 


1.152650 


0.7904221 


Rickettsia prowazekii 


Proteobacteria 


1.088237 


0.9192655 


1.173883 


0.8567044 


Pseudomonas aeruginosa 


Proteobacteria 


1.109776 


0.9154980 


1.187378 


0.8622321 


Borrelia burgdorferi 


Spirochaete 


1.111380 


0.9030539 


1.261299 


0.8298323 


Campylobacter jejuni 


Proteobacteria 


1.123096 


0.9053437 


1.279505 


0.8349793 


Ureaplasma urealyticum 


Gram-positive bacteria 


1.124616 


0.8843481 


1.260287 


0.8065916 


Helicobacter pylori J99 


Proteobacteria 


1.128590 


0.9299614 


1.390791 


0.8758443 


Helicobacter pylori 26695 


Proteobacteria 


1.149943 


0.9276062 


1.460757 


0.8719445 


Mycoplasma genitalium 


Gram-positive Eubacteria 


1.160435 


0.9142718 


1.365716 


0.8631789 


Methanococcus jannaschii 


Archaebacteria 


1.165208 


0.9113731 


1.349664 


0.8628226 
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FIG. 1. Histograms of substrings with different lengths 
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FIG. 6. "Analogous" specific heat of measures of substrings with different lengths K in Chromosome 22 of Homo sapiens. 
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FIG. 7. Dimension spectra of Chromosome 22 of Homo sapiens, Chromosome 2 of A. thaliana, Chromosome 3 of P. falciparum, 
Chromosome 1 of C. elegans, Chromosome 15 of S. cerevisiae and M. genitalium. 
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FIG. 8. "Analogous" specific heat of Chromosome 22 of Homo sapiens, Chromosome 2 of A. thaliana, Chromosome 3 of P. falciparum, 
Chromosome 1 of C. elegans, Chromosome 15 of S. cerevisiae, M. genitalium and complete random sequence. 
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